Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models

نویسندگان

  • Martin Kiefer
  • Max Heimel
  • Sebastian Breß
  • Volker Markl
چکیده

Accurately predicting the cardinality of intermediate plan operations is an essential part of any modern relational query optimizer. The accuracy of said estimates has a strong and direct impact on the quality of the generated plans, and incorrect estimates can have a negative impact on query performance. One of the biggest challenges in this field is to predict the result size of join operations. Kernel Density Estimation (KDE) is a statistical method to estimate multivariate probability distributions from a data sample. Previously, we introduced a modern, self-tuning selectivity estimator for range scans based on KDE that outperforms state-of-the-art multidimensional histograms and is e cient to evaluate on graphics cards. In this paper, we extend these bandwidth-optimized KDE models to estimate the result size of single and multiple joins. In particular, we propose two approaches: (1) Building a KDE model from a sample drawn from the join result. (2) E ciently combining the information from base table KDE models. We evaluated our KDE-based join estimators on a variety of synthetic and real-world datasets, demonstrating that they are superior to state-of-the art join estimators based on sketching or sampling.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Bandwidth Selection in Nonparametric Time - Varying Coefficient Models

Bandwidth plays an important role in determining the performance of local linear estimators. In this paper, we propose a Bayesian approach to bandwidth selection for local linear estimation of time–varying coefficient time series models, where the errors are assumed to follow the Gaussian kernel error density. A Markov chain Monte Carlo algorithm is presented to simultaneously estimate the band...

متن کامل

A bandwidth selection for kernel density estimation of functions of random variables

In this investigation, the problem of estimating the probability density function of a function of m independent identically distributed random variables, g(X1,X2, ...,Xm) is considered. The choice of the bandwidth in the kernel density estimation is very important. Several approaches are known for the choice of bandwidth in the kernel smoothing methods for the case m = 1 and g is the identity....

متن کامل

A Bayesian approach to parameter estimation for kernel density estimation via transformations

In this paper, we present a Markov chain Monte Carlo (MCMC) simulation algorithm for estimating parameters in the kernel density estimation of bivariate insurance claim data via transformations. Our data set consists of two types of auto insurance claim costs and exhibits a high-level of skewness in the marginal empirical distributions. Therefore, the kernel density estimator based on original ...

متن کامل

Boundary Adjusted Density Estimation and Bandwidth Selection

This paper studies boundary effects of the kernel density estimation and proposes some remedies to the problems. Since the kernel estimate is designed for estimating a smooth density, it introduces a large bias near the boundaries where the density is discontinuous. Bandwidth selectors developed for the kernel estimate that select a small bandwidth to reduce the bias can dramatically increase t...

متن کامل

On the Equivalence of Exact and Asymptotically Optimal Bandwidths for Kernel Estimation of Density Functionals

Given a density f we pose the problem of estimating the density functional ψr = ∫ f (r)f making use of kernel methods. This is a well-known problem but some of its features remained unexplored. We focus on the problem of bandwidth selection. Whereas all the previous studies concentrate on an asymptotically optimal bandwidth here we study the properties of exact, nonasymptotic ones, and relate t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2017